Multivariate Time Series Models: ARIMAX/SARIMAX/VAR
After fitting univariate time series models, it is clear that greater complexity may be needed to fit models that better understand the time series data. This section will include the fitting of five different multivariate time series models, including ARIMAX and VAR models. The best parameters for each model will be chosen by comparing RMSE values. This will provide insight into which factors have the greatest influence on public transit ridership according to the data, which is crucial for addressing many of the questions identified on earlier pages. From this analysis, we hope to use exogenous variables as additional information for forecasting.
Note: Much of the code and process on this page is re-purposed from:
Gamage, P. (2026). Applied Time Series for Data Science.
Article 1: Telework and the day-to-day variability of travel behaviour: The specificities of Fridays
This article details the phenomenon of telework increasing drastically in the wake of COVID-19, and the externalities of this change. Specifically, telework has great impact on both the environment and public transportation, both of which are considerations for assessing the efficacy of remote work. The authors suggest that the prevalence of telework poses a threat to the economic viability of public transit systems. Thus, this section will evaluate the relationship between public transit usage and the employment landscape to address the challenges outlined in the article.
Article 2: Modeling car ownership and use in a developing country context with informal public transportation
This article evaluates the ability for public transportation to exist in a society that is heavily reliant on cars. As we know, transitions from private automobile transportation to public transportation can result in many advantages for a municipality. However, the authors surmise that “only if major improvements to the public transportation services are enacted would a decrease in car ownership and usage be achieved” in societies that are already so car-focused. Therefore, this section will attempt to investigate the relationship between people’s decisions to operate cars, use rideshare services, or take public transportation to evaluate the ability for these methods to coexist.
In this model, we will evaluate the relationships between rideshare usage, vehicle miles, and public transit ridership. The purpose of this is to see the interaction between three of the most common options for transportation, and how each of these time series can be used as predictors for the others. Ultimately, consumers often have choices between utilizing these three options, or are inclined to use one over the others due to convenience, price, or speed. This model should help us understand how they relate.
Variables
VAR: Rideshare Usage, Vehicle-Miles, Public Transit Ridership
Question: How do the usages of different forms of transportation interact with one another?
Based on the correlation matrices of residuals, using \(p=1\) appears to be more effective.
Cross Validation
Code
n <-length(combined_scaled_ts[,1])k <-floor(n /2/2) *2h <-4n_k <- n - kq_steps <- n_k / hrmse1 <-matrix(NA, n_k, 3)rmse2 <-matrix(NA, n_k, 3)st <-tsp(combined_scaled_ts)[1] + (k -1) /4# Cross-validation loop for time series forecastingfor (i in1:q_steps) { # Define training and testing windows xtrain <-window(combined_scaled_ts, end = st + i -1) xtest <-window(combined_scaled_ts, start = st + (i -1) +1/4, end = st + i)######## First Model (VAR with p=1) ######## fit1 <- vars::VAR(xtrain, p =1, type ="both") fcast1 <-predict(fit1, n.ahead =2) f_miles <- fcast1$fcst$miles_ts_trimmed[, 1] f_uber <- fcast1$fcst$uber_ts_trimmed[, 1] f_ridership <- fcast1$fcst$ridership_ts_trimmed[, 1] ff1 <-data.frame(f_miles, f_uber, f_ridership) year <- st + (i -1) +1/4 ff1 <-ts(ff1, start =c(year, 1), frequency =4) a <-min(4* i -3) b <-min(4* i) rmse1[c(a:b), ] <-sqrt((ff1 - xtest) ^2) fit2 <- vars::VAR(xtrain, p =5, type ="both") fcast2 <-predict(fit2, n.ahead =2) f_miles2 <- fcast2$fcst$miles_ts_trimmed[, 1] f_uber2 <- fcast2$fcst$uber_ts_trimmed[, 1] f_ridership2 <- fcast2$fcst$ridership_ts_trimmed[, 1] ff2 <-data.frame(f_miles2, f_uber2, f_ridership2) year <- st + (i -1) +1/4 ff2 <-ts(ff2, start =c(year, 1), frequency =4) a <-min(4* i -3) b <-min(4* i) rmse2[c(a:b), ] <-sqrt((ff2 - xtest) ^2)}print(paste("Model 1 (p=1): ", mean(rmse1, na.rm =TRUE)))
Based on these RMSE values, it appears that \(p=5\) is the best model.
Chosen Model
My chosen model is VAR(5).
Forecasting
Code
library(patchwork)fit <-VAR(combined_ts, p =5, type ="both")fit_pr <-predict(fit, n.ahead =10, ci =0.95)actual_data <-as.data.frame(combined_ts)actual_data$Time <-as.numeric(time(combined_ts))f_ridership <- fit_pr$fcst$ridership_ts_trimmed[, "fcst"]f_miles <- fit_pr$fcst$miles_ts_trimmed[, "fcst"]f_uber <- fit_pr$fcst$uber_ts_trimmed[, "fcst"]ridership_lower <- fit_pr$fcst$ridership_ts_trimmed[, "lower"]ridership_upper <- fit_pr$fcst$ridership_ts_trimmed[, "upper"]miles_lower <- fit_pr$fcst$miles_ts_trimmed[, "lower"]miles_upper <- fit_pr$fcst$miles_ts_trimmed[, "upper"]uber_lower <- fit_pr$fcst$uber_ts_trimmed[, "lower"]uber_upper <- fit_pr$fcst$uber_ts_trimmed[, "upper"]# Create future time points as numerictime_index <-as.numeric(seq(from =end(combined_ts)[1] +1, length.out =10))# Create data frames for forecastsforecast_ridership <-data.frame(Time = time_index, Forecast = f_ridership, Lower = ridership_lower, Upper = ridership_upper)forecast_miles <-data.frame(Time = time_index, Forecast = f_miles, Lower = miles_lower, Upper = miles_upper)forecast_uber <-data.frame(Time = time_index, Forecast = f_uber, Lower = uber_lower, Upper = uber_upper)p1 <-ggplot() +geom_line(data = actual_data, aes(x = Time, y = ridership_ts_trimmed), color ="blue") +geom_ribbon(data = forecast_ridership, aes(x = Time, ymin = Lower, ymax = Upper), fill ="blue", alpha =0.2) +geom_line(data = forecast_ridership, aes(x = Time, y = Forecast), color ="blue", linetype ="dashed") +labs(title ="Ridership Forecast", x ="Time", y ="Public Transit Ridership") +theme_minimal()p2 <-ggplot() +geom_line(data = actual_data, aes(x = Time, y = miles_ts_trimmed), color ="red") +geom_ribbon(data = forecast_miles, aes(x = Time, ymin = Lower, ymax = Upper), fill ="red", alpha =0.2) +geom_line(data = forecast_miles, aes(x = Time, y = Forecast), color ="red", linetype ="dashed") +labs(title ="Vehicle Miles Forecast", x ="Time", y ="Vehicle Miles") +theme_minimal()p3 <-ggplot() +geom_line(data = actual_data, aes(x = Time, y = uber_ts_trimmed), color ="green") +geom_ribbon(data = forecast_uber, aes(x = Time, ymin = Lower, ymax = Upper), fill ="green", alpha =0.2) +geom_line(data = forecast_uber, aes(x = Time, y = Forecast), color ="green", linetype ="dashed") +labs(title ="Rideshare Users Forecast", x ="Time", y ="Rideshare Users") +theme_minimal()# Arrange plots using patchwork(p1 / p2) / p3
These forecasts show that both public transit ridership and total automobile miles driven are expected to slightly increase in the short-term, prior to a downturn. Meanwhile, rideshare users are expected to stagnate. It is reasonable to assume that given the small timeframe given in training, as well as COVID being responsible for a great deal of the existing data, this is not particularly convincing. Additionally, the 95% confidence bands do little to indicate that these are particularly confident forecasts.
Model 2:
In this model, we will evaluate the relationships between all transit types, of which there are six in the provided data. The purpose of this is to see the interaction between the common types of public transportation, and how each of these time series can be used as predictors for the others. This should help us understand whether certain forms of transportation trend differently, and which factors may influence that.
Variables
VAR: Ridership of all Transit Types
Question: Is there a relationship between the ridership of different transit types (bus, rail, etc.)?
Model Fitting
Code
heavy_rail_ts <-ts(ridership$`Heavy Rail (000s)`, start=c(1992,1), frequency =4)light_rail_ts <-ts(ridership$`Light Rail (000s)`, start=c(1992,1), frequency =4)commuter_rail_ts <-ts(ridership$`Commuter Rail (000s)`, start=c(1992,1), frequency =4)trolleybus_ts <-ts(ridership$`Trolleybus (000s)`, start=c(1992,1), frequency =4)bus_ts <-ts(ridership$`Bus (000s)`, start=c(1992,1), frequency =4)demand_response_ts <-ts(ridership$`Demand Response (000s)`, start=c(1992,1), frequency =4)
Cross validation confirms that \(p=1\) does produce better results.
Chosen Model
My chosen model is VAR(1).
Forecasting
Code
fit <-VAR(modes_ts, p =1, type ="both")fit_pr <-predict(fit, n.ahead =12, ci =0.95)actual_data <-as.data.frame(modes_ts)actual_data$Time <-as.numeric(time(modes_ts))f_heavy <- fit_pr$fcst$heavy_rail_ts[, "fcst"]f_light <- fit_pr$fcst$light_rail_ts[, "fcst"]f_commuter <- fit_pr$fcst$commuter_rail_ts[, "fcst"]f_trolley <- fit_pr$fcst$trolleybus_ts[, "fcst"]f_bus <- fit_pr$fcst$bus_ts[, "fcst"]f_demand <- fit_pr$fcst$demand_response_ts[, "fcst"]heavy_lower <- fit_pr$fcst$heavy_rail_ts[, "lower"]heavy_upper <- fit_pr$fcst$heavy_rail_ts[, "upper"]light_lower <- fit_pr$fcst$light_rail_ts[, "lower"]light_upper <- fit_pr$fcst$light_rail_ts[, "upper"]commuter_lower <- fit_pr$fcst$commuter_rail_ts[, "lower"]commuter_upper <- fit_pr$fcst$commuter_rail_ts[, "upper"]trolley_lower <- fit_pr$fcst$trolleybus_ts[, "lower"]trolley_upper <- fit_pr$fcst$trolleybus_ts[, "upper"]bus_lower <- fit_pr$fcst$bus_ts[, "lower"]bus_upper <- fit_pr$fcst$bus_ts[, "upper"]demand_lower <- fit_pr$fcst$demand_response_ts[, "lower"]demand_upper <- fit_pr$fcst$demand_response_ts[, "upper"]# Create future time points as numerictime_index <-as.numeric(seq(from =end(modes_ts)[1] +1, length.out =12))# Create data frames for forecastsforecast_heavy <-data.frame(Time = time_index, Forecast = f_heavy, Lower = heavy_lower, Upper = heavy_upper)forecast_light <-data.frame(Time = time_index, Forecast = f_light, Lower = light_lower, Upper = light_upper)forecast_commuter <-data.frame(Time = time_index, Forecast = f_commuter, Lower = commuter_lower, Upper = commuter_upper)forecast_trolley <-data.frame(Time = time_index, Forecast = f_trolley, Lower = trolley_lower, Upper = trolley_upper)forecast_bus <-data.frame(Time = time_index, Forecast = f_bus, Lower = bus_lower, Upper = bus_upper)forecast_demand <-data.frame(Time = time_index, Forecast = f_demand, Lower = demand_lower, Upper = demand_upper)p1 <-ggplot() +geom_line(data = actual_data, aes(x = Time, y = heavy_rail_ts), color ="blue") +geom_ribbon(data = forecast_heavy, aes(x = Time, ymin = Lower, ymax = Upper), fill ="blue", alpha =0.2) +geom_line(data = forecast_heavy, aes(x = Time, y = Forecast), color ="blue", linetype ="dashed") +labs(title ="Heavy Rail Forecast", x ="Year", y ="Heavy Rail Ridership") +theme_minimal()p2 <-ggplot() +geom_line(data = actual_data, aes(x = Time, y = light_rail_ts), color ="red") +geom_ribbon(data = forecast_light, aes(x = Time, ymin = Lower, ymax = Upper), fill ="red", alpha =0.2) +geom_line(data = forecast_light, aes(x = Time, y = Forecast), color ="red", linetype ="dashed") +labs(title ="Light Rail Forecast", x ="Year", y ="Light Rail Ridership") +theme_minimal()p3 <-ggplot() +geom_line(data = actual_data, aes(x = Time, y = commuter_rail_ts), color ="green") +geom_ribbon(data = forecast_commuter, aes(x = Time, ymin = Lower, ymax = Upper), fill ="green", alpha =0.2) +geom_line(data = forecast_commuter, aes(x = Time, y = Forecast), color ="green", linetype ="dashed") +labs(title ="Commuter Rail Forecast", x ="Year", y ="Commuter Rail Ridership") +theme_minimal()p4 <-ggplot() +geom_line(data = actual_data, aes(x = Time, y = trolleybus_ts), color ="purple") +geom_ribbon(data = forecast_trolley, aes(x = Time, ymin = Lower, ymax = Upper), fill ="purple", alpha =0.2) +geom_line(data = forecast_trolley, aes(x = Time, y = Forecast), color ="purple", linetype ="dashed") +labs(title ="Trolleybus Forecast", x ="Year", y ="Trolleybus Ridership") +theme_minimal()p5 <-ggplot() +geom_line(data = actual_data, aes(x = Time, y = bus_ts), color ="orange") +geom_ribbon(data = forecast_bus, aes(x = Time, ymin = Lower, ymax = Upper), fill ="orange", alpha =0.2) +geom_line(data = forecast_bus, aes(x = Time, y = Forecast), color ="orange", linetype ="dashed") +labs(title ="Bus Forecast", x ="Year", y ="Bus Ridership") +theme_minimal()p6 <-ggplot() +geom_line(data = actual_data, aes(x = Time, y = demand_response_ts), color ="brown") +geom_ribbon(data = forecast_demand, aes(x = Time, ymin = Lower, ymax = Upper), fill ="brown", alpha =0.2) +geom_line(data = forecast_demand, aes(x = Time, y = Forecast), color ="brown", linetype ="dashed") +labs(title ="Demand Response Forecast", x ="Year", y ="Demand Response Ridership") +theme_minimal()# Arrange plots using patchwork(p1 / p2) / p3
Code
(p4 / p5) / p6
These forecasts all tell roughly the same thing: we can expect a slight decline in the coming years for each mode of public transportation. The 95% confidence bands indicate that these are somewhat confident predictions, although may be misinformed due to a great deal of the movement in recent years coming from the COVID-19 pandemic. From these models, we can conclude that the modes are roughly correlated with one another, although we do see variation especially from trolleybus and buses which appear more correlated with each other than the remaining modes.
Model 3:
In this model, we will attempt to create statistics and forecasts that can be measured against univariate models fit against public transit ridership at a national level. In this case, the exogenous variables will be vehicle miles, gas prices, and unemployment rate. The purpose of using these is to contextualize public transit with other transportation options, as well as demand of transportation. Vehicle miles and gas prices serve as proxies for people’s inclination to travel via car, while unemployment rate relates to the number of people who may need to commute. These three exogenous variables are gathered at monthly rates, so averaging will be used to get quarterly values that match ridership data.
Variables
ARIMAX: Ridership ~ Vehicle Miles + Gas Prices + Unemployment Rate
Question: To what extent is public transit ridership impacted by larger macroeconomic phenomena?
p1 <-plot_ly(ridership, x =~Date, y =~`Total Ridership (000s)`, type ='scatter', mode ='lines', name ="Passenger Trips") %>%layout(yaxis =list(title ="Trips"))p2 <-plot_ly(ridership, x =~Date, y =~vehicle_miles, type ='scatter', mode ='lines', name ="Vehicle Miles") %>%layout(yaxis =list(title ="Miles"))p3 <-plot_ly(ridership, x =~Date, y =~gas_prices, type ='scatter', mode ='lines', name ="Gas Prices") %>%layout(yaxis =list(title ="Price"))p4 <-plot_ly(ridership, x =~Date, y =~unemployment_rate, type ='scatter', mode ='lines', name ="Unemployment Rate") %>%layout(yaxis =list(title ="Rate"))final_plot <-subplot(p1, p2, p3, p4, nrows =4, shareX =TRUE, titleX =TRUE) %>%layout(title ="Time Series of Several Variables",yaxis =list(title ="Trips"), yaxis2 =list(title ="Miles"), yaxis3 =list(title ="Price"),yaxis4 =list(title ="Rate"),xaxis =list(title ="Date") )final_plot
This plot shows that each of these variables were affected by the events in 2020, to varying degrees, and have had time to settle in the years since. The relationships are not immediately obvious, but those should be uncovered by fitting a model.
Series: y
Regression with ARIMA(0,1,0)(1,0,1)[4] errors
Coefficients:
sar1 sma1 vehicle_miles gas_prices unemployment_rate
0.9581 -0.7228 1.3404 25464.91 -127842.779
s.e. 0.0371 0.1132 0.2178 25081.27 7045.816
sigma^2 = 4.499e+09: log likelihood = -1540.7
AIC=3093.4 AICc=3094.13 BIC=3110.27
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -1841.28 65428.32 49521.02 -0.2359171 2.539576 0.3710027 0.1905822
Code
checkresiduals(fit)
Ljung-Box test
data: Residuals from Regression with ARIMA(0,1,0)(1,0,1)[4] errors
Q* = 30.425, df = 6, p-value = 3.263e-05
Model df: 2. Total lags used: 8
Code
res.fit <-ts(residuals(fit), start =decimal_date(as.Date("1993-04-01", format ="%Y-%m-%d")), frequency =4)
Code
SARIMA.c <-function(p_range, d_range, q_range, P_range, D_range, Q_range, data) { s <-12 results_list <-list()for (p in p_range) {for (d in d_range) {for (q in q_range) {for (P in P_range) {for (D in D_range) {for (Q in Q_range) {if (p + d + q + P + D + Q <=10) {tryCatch({# Fit SARIMA model model <-Arima(data, order =c(p, d, q), seasonal =c(P, D, Q))# Store results results_list[[length(results_list) +1]] <-c(p, d, q, P, D, Q, model$aic, model$bic, model$aicc) }, error =function(e) {# Handle errors without breaking the loopcat("Model failed for (p=", p, ", d=", d, ", q=", q, ", P=", P, ", D=", D, ", Q=", Q, ")\n") }) } } } } } } } results_df <-as.data.frame(do.call(rbind, results_list))colnames(results_df) <-c("p", "d", "q", "P", "D", "Q", "AIC", "BIC", "AICc")return(results_df)}output <-SARIMA.c(p_range =0:3, q_range =0:3, d_range =0:1, D_range =0:1, P_range =0, Q_range =1:2, data = res.fit)minaic <- output[which.min(output$AIC), ]minbic <- output[which.min(output$BIC), ]print(minaic)
p d q P D Q AIC BIC AICc
56 1 1 1 0 1 2 2983.791 2997.687 2984.322
Series: y
Regression with ARIMA(1,1,1)(0,1,2)[4] errors
Coefficients:
ar1 ma1 sma1 sma2 vehicle_miles gas_prices
0.8483 -0.5294 -0.9724 0.1422 1.0509 53043.62
s.e. 0.0858 0.1181 0.1147 0.1153 0.2443 22979.42
unemployment_rate
-132741.347
s.e. 5789.683
sigma^2 = 3.824e+09: log likelihood = -1480.4
AIC=2976.8 AICc=2978.11 BIC=2999.03
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 1707.84 58767.78 44100.25 0.0480734 2.223013 0.3303912 -0.08269953
Chosen Model:ARIMA(1,1,1)(0,1,2)[4]
RMSE:\(58,767.78\)
Compared to Univariate RMSE:\(156,042.40\)
Equation:\[y_t = 0.8483\, y_{t-1}
- 0.5294\, \epsilon_{t-1}
- 0.9724\, \epsilon_{t-12}
+ 0.1422\, \epsilon_{t-24}
+ 1.0509\, x_{1,t}
+ 53043.62\, x_{2,t}
- 132741.347\, x_{3,t}
+ \epsilon_t\], where \(x_1\) is vehicle miles, \(x_2\) is gas prices, and \(x_3\) is unemployment rate.
Forecasting
Code
fcast <-forecast(fit, xreg = fxreg, h =20)
Warning in forecast.forecast_ARIMA(fit, xreg = fxreg, h = 20): xreg contains
different column names from the xreg used in training. Please check that the
regressors are in the same order.
Code
autoplot(fcast) +ggtitle("Forecast of Ridership for the Next 20 Quarters (5 Years)") +xlab("Year") +ylab("Total Ridership") +theme_minimal()
In contrast with what we got from univariate modeling, this forecast predicts that ridership will gradually increase over the next five years, while maintaining the same seasonality we’ve been seeing. Due to the significantly lower RMSE value here, this is a more reliable prediction, despite still having noticeably imprecise confidence bands. This tells us that the exogenous variables are indeed useful to include as predictors in addition to past ridership values. What’s also notable is that this upward trend is clearly less steep than what we’ve seen in the post-pandemic time window, indicating a prediction that the recovery will somewhat ease up.
Model 4:
This model will use the same variables as the previous one, but instead use vehicle miles as the target variable with public transit ridership as an exogenous variable. This will allow us to consider a different transportation method as a predictor, and see if people’s inclination to drive a car is impacted by the same factors. A plot in the previous section shows these variables concurrently to display any immediate relationships that may be visible.
Variables
ARIMAX: Vehicle Miles ~ Public Transit Ridership + Gas Prices + Unemployment Rate
Question: To what extent is automobile usage impacted by larger macroeconomic phenomena?
Series: y2
Regression with ARIMA(1,1,0)(2,0,0)[4] errors
Coefficients:
ar1 sar1 sar2 Total Ridership (000s) gas_prices
0.6747 -0.4669 -0.1504 0.0191 1867.510
s.e. 0.0759 0.1118 0.1023 0.0124 4881.099
unemployment_rate
-4081.588
s.e. 1914.553
sigma^2 = 412609788: log likelihood = -1392.14
AIC=2798.28 AICc=2799.26 BIC=2817.97
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 4112.976 19731.13 10151.7 0.1495027 0.348495 0.1724709 -0.1209944
The chosen model for this is ARIMA(1,1,0)(2,0,0)[4]
RMSE: 19731.13
Compared to univariate RMSE: 156042.4
Equation:\[y_t = 0.6747\, y_{t-1}
- 0.4669\, y_{t-12}
- 0.1504\, y_{t-24}
+ 0.0191\, x_{1,t}
+ 1867.510\, x_{2,t}
- 4081.588\, x_{3,t}
+ \epsilon_t\] where \(x_1\) is public transit ridership, \(x_2\) is gas prices, and \(x_3\) is unemployment rate.
Forecasting
Code
fcast2 <-forecast(fit2, xreg = fxreg2, h =20)autoplot(fcast2) +ggtitle("Forecast of Vehicle Miles for the Next 20 Quarters (5 Years)") +xlab("Year") +ylab("Million Vehicle Miles") +theme_minimal()
This forecast is a bit less telling than that for public transit ridership. We simply see a constant line that extends without much variation over the next five years with very wide confidence bands. However, the RMSE is significantly less than that of our univariate model, indicating that the inclusion of exogenous variables is indeed useful.
Model 5:
In this model, we will evaluate the relationships between the top five cities in the U.S. in public transit ridership. The purpose of this is to compare trends within cities interact with one another, and whether some cities are predictive of others.
Variables
VAR: Ridership of top 5 cities in the U.S.
Question: To what extent are trends in some U.S. metropolitan areas predictive of others?
Model Fitting
Code
cities <-read_csv("../data/monthly_data.csv")cities$Month <-as.Date(cities$Month, format ="%m/%d/%y")cities <- cities[order(cities$Month),]nyc_ts <-ts(cities$`UPT - New York--Jersey City--Newark, NY--NJ`, start=c(2002,1), frequency =12)la_ts <-ts(cities$`UPT - Los Angeles--Long Beach--Anaheim, CA`, start=c(2002,1), frequency =12)chi_ts <-ts(cities$`UPT - Chicago, IL--IN`, start=c(2002,1), frequency =12)sf_ts <-ts(cities$`UPT - San Francisco--Oakland, CA`, start=c(2002,1), frequency =12)dc_ts <-ts(cities$`UPT - Washington--Arlington, DC--VA--MD`, start=c(2002,1), frequency =12)
My chosen model is VAR(1) based on these RMSE values.
Forecasting
Code
fit <-VAR(combined_ts, p =1, type ="both")fit_pr <-predict(fit, n.ahead =5, ci =0.95)actual_data <-as.data.frame(combined_ts)actual_data$Time <-as.numeric(time(combined_ts))f_nyc <- fit_pr$fcst$nyc_ts[, "fcst"]f_la <- fit_pr$fcst$la_ts[, "fcst"]f_chi <- fit_pr$fcst$chi_ts[, "fcst"]f_sf <- fit_pr$fcst$sf_ts[, "fcst"]f_dc <- fit_pr$fcst$dc_ts[, "fcst"]nyc_lower <- fit_pr$fcst$nyc_ts[, "lower"]nyc_upper <- fit_pr$fcst$nyc_ts[, "upper"]la_lower <- fit_pr$fcst$la_ts[, "lower"]la_upper <- fit_pr$fcst$la_ts[, "upper"]chi_lower <- fit_pr$fcst$chi_ts[, "lower"]chi_upper <- fit_pr$fcst$chi_ts[, "upper"]sf_lower <- fit_pr$fcst$sf_ts[, "lower"]sf_upper <- fit_pr$fcst$sf_ts[, "upper"]dc_lower <- fit_pr$fcst$dc_ts[, "lower"]dc_upper <- fit_pr$fcst$dc_ts[, "upper"]# Create future time points as numerictime_index <-as.numeric(seq(from =end(combined_ts)[1] +1, length.out =5))# Create data frames for forecastsforecast_nyc <-data.frame(Time = time_index, Forecast = f_nyc, Lower = nyc_lower, Upper = nyc_upper)forecast_la <-data.frame(Time = time_index, Forecast = f_la, Lower = la_lower, Upper = la_upper)forecast_chi <-data.frame(Time = time_index, Forecast = f_chi, Lower = chi_lower, Upper = chi_upper)forecast_sf <-data.frame(Time = time_index, Forecast = f_sf, Lower = sf_lower, Upper = sf_upper)forecast_dc <-data.frame(Time = time_index, Forecast = f_dc, Lower = dc_lower, Upper = dc_upper)p1 <-ggplot() +geom_line(data = actual_data, aes(x = Time, y = nyc_ts), color ="blue") +geom_ribbon(data = forecast_nyc, aes(x = Time, ymin = Lower, ymax = Upper), fill ="blue", alpha =0.2) +geom_line(data = forecast_nyc, aes(x = Time, y = Forecast), color ="blue", linetype ="dashed") +labs(title ="New York, NY Forecast", x ="Year", y ="New York, NY Ridership") +theme_minimal()p2 <-ggplot() +geom_line(data = actual_data, aes(x = Time, y = la_ts), color ="red") +geom_ribbon(data = forecast_la, aes(x = Time, ymin = Lower, ymax = Upper), fill ="red", alpha =0.2) +geom_line(data = forecast_la, aes(x = Time, y = Forecast), color ="red", linetype ="dashed") +labs(title ="Los Angeles, CA Forecast", x ="Year", y ="Los Angeles, CA Ridership") +theme_minimal()p3 <-ggplot() +geom_line(data = actual_data, aes(x = Time, y = chi_ts), color ="green") +geom_ribbon(data = forecast_chi, aes(x = Time, ymin = Lower, ymax = Upper), fill ="green", alpha =0.2) +geom_line(data = forecast_chi, aes(x = Time, y = Forecast), color ="green", linetype ="dashed") +labs(title ="Chicago, IL Forecast", x ="Year", y ="Chicago, IL Ridership") +theme_minimal()p4 <-ggplot() +geom_line(data = actual_data, aes(x = Time, y = sf_ts), color ="purple") +geom_ribbon(data = forecast_sf, aes(x = Time, ymin = Lower, ymax = Upper), fill ="purple", alpha =0.2) +geom_line(data = forecast_sf, aes(x = Time, y = Forecast), color ="purple", linetype ="dashed") +labs(title ="San Francisco, CA Forecast", x ="Year", y ="San Francisco, CA Ridership") +theme_minimal()p5 <-ggplot() +geom_line(data = actual_data, aes(x = Time, y = dc_ts), color ="orange") +geom_ribbon(data = forecast_dc, aes(x = Time, ymin = Lower, ymax = Upper), fill ="orange", alpha =0.2) +geom_line(data = forecast_dc, aes(x = Time, y = Forecast), color ="orange", linetype ="dashed") +labs(title ="Washington, DC Forecast", x ="Year", y ="Washington, DC Ridership") +theme_minimal()# Arrange plots using patchworkp1 / p2
Code
p3 / p4
Code
p5
Much like the other VAR models, this is not a particularly convincing forecast. In all cases except for Chicago, we are seeing a gradually decreasing slope with any seasonality lost. Confidence bands tell us that these predictions are imprecise. It is interesting to see the disparities in the predictions, indicating that there are differences to be identified between these cities. This will be an interesting phenomenon to further explore in the interrupted time series section.